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ABSTRACT 

We study the complexity of query answering using views in a prob- 
abilistic XML setting, identifying large classes of XPath queries - 
with child and descendant navigation and predicates - for which 
there are efficient (PTime) algorithms. We consider this problem 
under the two possible semantics for XML query results: with per- 
sistent node identifiers and in their absence. Accordingly, we con- 
sider rewritings that can exploit a single view, by means of com- 
pensation, and rewritings that can use multiple views, by means of 
intersection. Since in a probabilistic setting queries return answers 
with probabilities, the problem of rewriting goes beyond the clas- 
sic one of retrieving XML answers from views. For both semantics 
of XML queries, we show that, even when XML answers can be 
retrieved from views, their probabilities may not be computable. 

For rewritings that use only compensation, we describe a PTime 
decision procedure, based on easily verifiable criteria that distin- 
guish between the feasible cases - when probabilistic XML results 
are computable - and the unfeasible ones. For rewritings that can 
use multiple views, with compensation and intersection, we iden- 
tify the most permissive conditions that make probabilistic rewrit- 
ing feasible, and we describe an algorithm that is sound in general, 
and becomes complete under fairly permissive restrictions, running 
in PTime modulo worst-case exponential time equivalence tests. 
This is the best we can hope for since intersection makes query 
equivalence intractable already over deterministic data. Our al- 
gorithm runs in PTime whenever deterministic rewritings can be 
found in PTime. 

1. INTRODUCTION 

Uncertainty is ubiquitous in data and many applications must 
cope with it [21]: information extraction from the World Wide 
Web [13], automatic schema matching in data integration [31], or 
data-collecting sensor networks [30] are inherently imprecise. This 
uncertainty is sometimes represented as the probability that the data 
is correct, as with conditional random fields [24] in information ex- 
traction, or uncertain schema mappings in [19]. In other cases, only 
confidence in the information is provided by the system, and can be 
seen after renormalization as approximation of probabilities. It is 
thus natural to manipulate such probabilistic information in a prob- 
abilistic database management system [15]. 
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Recent work has proposed models for probabilistic data, both in 
the relational [35, 16, 23] and XML [28, 22, 2] settings. We focus 
here on the latter case, which is particularly adapted for the Web. 
A number of studies on probabilistic XML have dealt with query 
answering for a variety of models and query languages [1, 28, 2, 
22]. At the same time, query optimization over probabilistic data 
has received little attention. In particular, the problem of answering 
queries using views, a key approach for optimization, has received 
no attention so far in both the relational and the semistructured set- 
tings. Yet probabilistic query evaluation could greatly benefit from 
such techniques, as it is often the case that computing probabilistic 
results is harder than in the deterministic setting. 

Views over XML documents can be seen as fragments of data 
that may be available for further querying. Over a probabilistic do- 
cument, these data fragments come together with their probability. 
Given a document d, a set of views vi, . . . ,v n , and a query q, the 
goal is to understand whether one can obtain q(d), the answers of 
q over d, by accessing view results vi(d),..., v n (d) only. 

For deterministic data, prior research [36, 25] on XPath rewrit- 
ing studied the problem of equivalently rewriting an XPath query 
by navigating inside a single materialized XPath view. This would 
be the only kind of rewriting supported when the query cache can 
store or obtain only copies of the XML elements in a query answer, 
while the original node identities are lost. Following a recent in- 
dustrial trend (supported by systems such as [6]) towards enhanc- 
ing XPath queries with the ability to expose node identifiers and 
exploit them via identity-based equality, techniques for multiple- 
view rewritings built by intersecting several materialized view re- 
sults were proposed. These are potentially more beneficial, as many 
queries with no single-view rewriting can be rewritten using multi- 
ple views. [8] studied the complexity of rewriting XPath using an 
intersection of views and described algorithms that apply for any 
documents and type of identifiers, including application level Ids. 

We study in this paper the complexity of answering queries us- 
ing views in a probabilistic XML setting, identifying large classes 
of XPath queries for which there are efficient (PTime) algorithms. 
Polynomial time techniques for view-based rewriting are in our 
view even more important here than in the deterministic case, given 
that query evaluation over probabilistic XML is intractable (in com- 
bined data and query complexity) [22]. To the best of our knowl- 
edge, our work is the first to address this view-based rewriting prob- 
lem. Since in a probabilistic setting queries return answers with 
probabilities, the problem of rewriting goes beyond the classical 
one of retrieving XML answers from views. 

Contributions. We study the rewriting problem under the two 
possible semantics for probabilistic XML results: with persistent 
node Ids and in their absence. Accordingly, we consider alternative 
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plans (rewritings) that can exploit a single view, by means of com- 
pensation, or plans that can use multiple views, with intersection. 

We first show that, in the probabilistic setting, the problem of an- 
swering queries using views becomes more complex and it does not 
reduce to its deterministic version. The reason is that query results 
now involve not only data trees, but also their probabilities. Hence 
probabilities should also be retrieved from probabilistic view re- 
sults, by means of a probabilistic function computing them. 

Even for the simpler setting (without persistent Ids), the exis- 
tence of the probabilistic function is not guaranteed by the ex- 
istence of a data-retrieving rewriting. We first present examples 
of views and queries for which such a function does not exist. 
Based on a certain notion of probabilistic independence between 
queries that we introduce (called condition-independence; in short 
c-independence), we identify the tightest class of queries and views 
for which this function exists, and we describe how it can be com- 
puted efficiently. Before describing the general solution, we discuss 
in Section 4.3 a particular case that allows (i) a concise and intuitive 
formulation of the probabilistic function, (ii) an efficient evaluation 
over the view document, with no (or little) post-processing. 

For rewritings with intersection, we first provide a sufficient con- 
dition - also based on c-independence - that guarantees that the 
probabilities of query answers can be computed as a product-like 
formula over the probabilities of the views appearing in the inter- 
section. For this sound approach, we also present an NP-hardness 
result for deciding whether a selection of c-independent views for a 
rewriting is possible. Then, going beyond rewritings that assume c- 
independent views, we present a sound algorithm, complete under 
fairly permissive restrictions, whose complexity drops to PTime 
under widely-applicable assumptions. More precisely, it runs in 
PTime modulo worst-case exponential time equivalence tests, with 
this upper-bound being strict whenever deterministic rewritings can 
be found in PTime. This is the best we can hope for as intersection 
makes query equivalence intractable already over deterministic data. 

All our results are practically interesting as they allow expressive 
queries and views, with descendant navigation and path filter pred- 
icates. For both semantics, the evaluation of an alternative plan is 
no more expensive than query evaluation over probabilistic XML. 

Outline of the paper. Preliminaries are given in Section 2. We 
formalize the view-based rewriting problem for probabilistic XML 
in Section 3. We then present our results for the two semantics of 
XML query results: in the absence of persistent node Ids, in Sec- 
tion 4, and in their presence, in Section 5. We discuss other related 
work in Section 6. We consider possible directions for future work 
and we conclude in Section 7. 

2. PRELIMINARIES 

We first describe the data and query model, which largely relies 
on the terminology and notation of [10, 9] and [2]. Minor exten- 
sions for probabilistic view-based query rewriting are given in Sec- 
tion 3.1, with the problem statement. For a more detailed presenta- 
tion of probabilistic XML we refer the reader to [2]. 

XML documents. We assume the existence of a set of labels C 
that subsumes both XML tags and values. We consider an XML 
document as an unranked, unordered rooted tree d modeled by a 
set of edges edges(d), a set of nodes nodes(d), a distinguished 
root node root(d) and a labeling function Ibl, assigning to each 
node a label from C. The label of root(d) is called the document 
name of d. We assume that each node n G nodes (d) has a unique 
identifier (e.g., a numeric value) denoted Id(n). 

EXAMPLE 1. Consider the document dpER in Figure I (where 
PER stands for personnel), describing the personnel of an IT de- 
partment and the bonuses distributed for different projects. The 



ripER : [1] IT- personnel 

[2] person ^""^ [3] person 

^ \ / \ 

[4] name [5] bonus [6] name [7] bonus 

I / \ II 

[8] Rick [24] laptop [31] pda [41] Mary [51] pda 

/ \ I / \ 

[25] 44 [26] 50 [32] 50 [54] 75 [55] 44 

Figure 1: Example document g!per 

[1] IT- personnel 




[11] (mux) [21 j ( mux 
0.75/ \o.25 o.1 
[8] Rick [13] John [22] pda 



[3] person 
/ \ 
[6] name [7] bonus 



[41] Mary [51] pda 



[24] laptop 

/ \ 
[23] 25 [25] 44 [26] 50 




[53] ( 
1 

[54] 15 [55] 44 

Figure 2: Example p-document Pper 
document dpER indicates that Rick worked under two projects 
(laptop and pda) and got bonuses of 44 and 50 in the former 
project and 25 in the latter one. Identifiers are written inside square 
brackets and labels are next to them, e.g., the node ru is labeled 
name, i.e., Ibl(ri4) = name. 

Probabilistic documents. A finite probability space of XML 
documents, or px-space for short, is a pair (V, Pr) with V being 
a set of documents and Pr mapping every d G T> to a probability 
Pr(d) s.t. £{Pr(d) | d € V} = 1. 

p-Documents [2] give a general syntax for compactly represent- 
ing px-spaces. Like a document, a p-document is a tree but with 
two kinds of nodes: ordinary nodes, which have labels and are as in 
documents, and distributional, which are used to define the prob- 
abilistic process for generating random documents. We consider 
two kinds of distributional nodes: mux (for mutually exclusive) and 
ind (for independent). 

DEFINITION 1. A p-document V is an unranked, unordered 
tree with a set of edges edges(V), nodes nodes(V), the root 
node root(V), and a labeling function Ibl, assigning to each node 
n a label from C U {ind(Pr n ), mux(Pr n )}. If\b\(n) is mux(Pv n ) 
or ind(Pr n ), then Pr„ assigns to each child n' of n a probability 
Pr„(n'), and ;/lbl(n) = mux(Pr n ), then also £ n , Pr n (n') < 1- 
We require leaves and the root to be C-labeled.^ 

EXAMPLE 2. Fig. 2 shows a p-document "Pper (PER stands 
for personnel) having mux and ind distributional nodes, shown on 
gray background. Node 7152 is a mux node with two children 7x53 
andn 5 e, where Pr„ S2 (n53) = 0.7 and Pr„ 52 (rise) = 0.3. 

A p-document V has as associated semantics a px-space \P\ de- 
fined by runs of the following random process: independently for 
each mux(Pr n ) (resp. ind(Pr n )) node, select at most one (resp. 
some) of its children n' and delete all other children along with 
their descendants; then remove in turn each distributional node, 
connecting ordinary children of deleted distributional nodes with 
their closest ordinary ancestors. The result of such a run is a ran- 
dom document V (an ordinary document), whose nodes (Ids) are a 
subset of those of V. Note that there might be several runs result- 
ing in the same V, e.g., by different choices under ordinary nodes 
of V that are not kept in V . The probability of a run is the product 
of all (i) Pr n (n') for each chosen child n of a mux or ind node n, 
(ii) 1 — Pr n (n') for each not chosen child n' of a ind node n, 
(Hi) 1 — 2~2 n ' P r «( n ') f° r a U children n' of each mux n for which 
no children were chosen. The probability of a random document 
V, Pr('P), is the sum of probabilities of all runs resulting in V ■ 

EXAMPLE 3. One can obtain the document ripER from "Pper 
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Qrbon: ; Qbon'- 

IT- personnel i IT- personnel 
II i II 
person : person 

name I : 
| (bonus) : 

Rick laptop 1 



laptop 



v BON : 
IT- personnel 
person 



name 



(bonus) 



Rick 



2 

"BON : 

IT- personnel 
II 

person 
(bonus) 



Figure 3: TP queries <?rbon, ?bon and TP views: «bon> v| Q n 

by choosing: the left child of the mux node nn, the right child of the 
mux node 7121, the left child of the mux node 7152, and either child 
of the ind node n^-j,. The marginal probability of these choices ( and 
the probability of d PER ), is 0.4725 = 0.75 x 0.9 x 0.7 x 1 x 1. 

By d" we denote the subdocument of d rooted at n and by V" - 
the p-subdocument of V rooted at a node n. Note that other kinds of 
local distributional nodes: det (i.e., deterministic) and exp (i.e., ex- 
plicit) are studied in [2] and all the results of this paper remain valid 
for p-documents with all four of these distributional nodes. We use 
only mux and ind nodes here because they are convenient and the 
model based on them is a complete representation system [2]. 
Tree-pattern queries. The language of tree-pattern queries (TP) 
is roughly the subset of navigational XPath with child, descendant 
navigation, predicates, and without wildcards. 

DEFINITION 2. A tree-pattern q is a non-empty, unordered, un- 
ranked rooted tree, with a set of nodes nodes(q) labeled with sym- 
bols from C, a distinguished node called the output node out(q) 
(i.e., tree-patterns are unary queries), and two types of edges: child 
edges, labeled by / and descendant edges, labeled by //. The root 
of q is denoted root(q). 

The main branchmb(q) of q is the path from root (q) to out (5). 
The depth of a main branch node is the distance from it to the root, 
i.e., the depth of root(g) is 1 and of out (5) is |mb(g)|. 

For ease of exposition, we often write tree-patterns q in XPath 
notation [7], and we refer to this notation by xpath(q). We use 
lbl(g) as short notation for lbl(out(g)) and the following graph- 
ical representation for tree-patterns: we use single lines to denote 
child edges and double lines for descendant edges, the main branch 
is the vertical path starting from the root, the output node is in a 
circle, and predicates are subtrees starting with side branches (see 
Figure 3). We say that a TP query is formulated over a document d 
or over a p-documentV, if lbl(root(g)) = lbl(root(d)), respec- 
tively lbl(root(g)) = lbl(root(P)). 

EXAMPLE 4. Consider the queries grtBON and gBON in Fig- 
ure 3, left, (where BON stands for bonuses and RBON/or Rick 'a- 
bonuses), gRBON asks for bonuses o/Rick received for the project 
Laptop and gBON asks for bonuses on Laptop. The other two 
queries u BO n and v^ on in Figure 3, right, ask for Rick's bonuses 
and just for bonuses, respectively. The output nodes of all these 
queries are labeled with bonus. 

The semantics of tree-patterns can be given using embeddings. 
An embedding e of a TP query q into a document d is a func- 
tion from nodes(g) to nodes(d) satisfying: (i) e(root(g)) = 
root(d); (ii) for any n G nodes(g), Ibl(e(n)) = lbl(n); (Hi) for 
any /-edge (m, n 2 ) in q, (e(m), e{n 2 )) is an edge in d; (iv) for 
any //-edge (m, n 2 ) in q, there is a path from e(m) to e{n 2 ) in d. 

The result of applying a tree-pattern q to a document d is the set: 

q(d) := {e(out(g)) | e is an embedding of q into d} . 

EXAMPLE 5. For the queries in Figure 3, gRBON (dpER ) = 
<7BON(dpER.) = ^bon (^per) = {n 5 }, "Ion^per) = {n 5 ,n 7 }. 

Intersections of tree-patterns. We consider in this paper the 
extension TP n of TP, denoting intersections of tree-pattern queries: 



TP n = {91 n ■ ■ ■ n q k I k e N, qi e TP}. 

We say that a TP n query q = (~) j=1 qi is formulated over a set of 
documents D if U*Li lbl(root(g;)) = {lbl(root(d)) | de D}. 
Its result over D is the node set Hi=i 1i{d I d G D , lbl(root((fc)) 
= lbl(root(d))). Note that unsatisfiable TP n patterns q are possi- 
ble (when there is no documents D s.t. q(D) / 0). For the purpo- 
ses of our paper, we assume hereafter only satisfiable TP n -patterns; 
satisfiability can be tested in straightforward manner (we refer the 
reader to this paper's extended version [1 1], for more details). 

Query equivalence and containment.. A pattern q\ is con- 
tained in a pattern q2, denoted gi jZ q2, if qi(d) C g 2 (d) for 
every d. Also gi is equivalent to 52, 91 = 92, if gi jZ q 2 and 
q 2 C gi. We discuss how to check containment of TP n queries 
in Section 5. For TP queries, containment can be decided using 
containment mappings [4, 27] which are similar to embeddings. 
In short, a containment mapping from gi to 52 is a function from 
nodes(gi) to nodes(g2) that respects the labels of nodes and 
maps any two nodes connected with /-edges to nodes connected 
with /-edges, while nodes connected with //-edges can be mapped 
to any connected nodes. Then for qi and q 2 in TP, q 2 C qi iff 
there is a containment mapping from qi to q 2 [27]. Note that such 
a mapping can be computed in polynomial time. For example, ob- 
serve that gRBON is contained in «bon> an d m 9bon, «bon> while 
none of the latter two queries is contained in each other. 

Unless stated otherwise, in this paper all the TP-queries are as- 
sumed to be minimized, i.e. without subsumed subqueries that have 
the same root (minimization can be done in PTime); equivalence 
of minimized queries amounts to isomorphism [27]. 

Querying p-documents. So far, queries were functions over 
XML documents, outputting sets of nodes. Over p-documents V, a 
query q (TP or TP n ) naturally yields a set of pairs node-probability 
(n, p), for n a node of V and p the probability that q can be embed- 
ded into a random document V of V by some e s.t. e(out (q)) = n; 
this value (p) will also be written as Pr(n G q(V)). Formally: 

q(V) := {{n,p) \ P = J2 vel r h neq(V) Pr(7»)}. 

It is known [22] that TP queries can be evaluated over p-docume- 
nts V in PTime in \V\ (data complexity); the same holds for TP n . 

EXAMPLE 6. (Jbon returns the node n$ iff the right child of 
the node n 2 i is chosen, thus, 5bon(Pper) = {(nr,, 0.9)}. i> B on 
returns nr, iff the left child of 'nu is chosen, thus, «bon(^per) = 
{(ns, 0.75)}. gRBON returns iff both of the above conditions are 
satisfied, thus, gRBON (T^per.) = {(ns,0.9 x 0.75)}. Since vjloN 
has no predicates, ns and n-j are labeled with bonuses and are not 
probabilistically conditioned: «bon(7'per) = {("-5, 1), {nr, 1)}- 

3. VIEW-BASED REWRITING 

We assume a set of view names V disjoint from the set of labels 
C By a view v we denote a tree-pattern query (that defines the 
view) together with its name v € V. 

Deterministic view-based rewriting. Let d be a document, 
v a view. A (deterministic) view extension of v over d, denoted d v , 
is an XML document obtained by connecting to a root node labeled 
by a special label doc{v) all the documents from the set 

{d' I d' subtree of d s.t. root(d') G v(d)}. 

Hence d v can be queried by queries of the form doc(v)/\b\(v) / . . . . 
If V is a set of views defined over d, then Dy = {d v | v G V}. 

For Q either TP or TP n and q G Q that may use doc{v) /lbl(v), 
for v € V, the unfolding of q with V, denoted unfold v {q), is a 
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(dpER.) 



[°]doc(f BON ) 

[5] bonus 

/ \ 
[24] laptop [31] pda 

/ \ I 
[25] 44 [26] 50 [32] 50 



(7 , PEr) v i 



[21] 
0.1 

[22] pda 



[0] rfo c(< ON )^ 
["^] ( ind ]l 

0.75 I ld(1) 

[5] bonus 

ld{21) \ ld(5) 
[31] pda 



ld(22) 



ld(23) 



[24] laptop " / \ 

/ \ "ld?24) / ld ( 31 ) 
[23] 25 [25] 44 [26] 50 [32] 50 

ld(32) 



I 



ld(25) ld(26) 

Figure 4: View extensions (d PE R)„i and (Pper/Li 

v BON v /u BON 

Q-query obtained from q by replacing each occurrence of 
doc(v) /\b\(v) with the definition of v. 

EXAMPLE 7. !) BON and v BON in Figure 3 are views. The view 

extension (oVerLi is in Figure 4, left. Also, (dpER)„2 has 
'"bob j v '"bon 

a root labeled doc(v BON ) under which there are the subtrees of 
dpER rooted at n$ and nj. 

In the deterministic setting, the problem of query answering using 
views is to find an alternative query plan q r , called a rewriting, that 
can be used to answer q. Formally: 

DEFINITION 3. Let dbe a document, q a TV-query and V a set 
of TP-views over d, Q G {TP, TP n }. A deterministic Q-rewriting 
of q using V is a query q r G Q over D v s.t. unfoldy (q r ) = q, i.e., 
for any instance of d, unfoldy (q r ) (d) = q(d). 
The two alternatives, TP-rewritings and TP r ' -rewritings, are re- 
spectively motivated by the two possible interpretations of XML 
query results. In an XML document, nodes have unique Ids used 
by internal operators (selections, unions, joins, etc.) to manipulate 
data during query evaluation. Queries can then either ( i) introduce 
fresh Ids for the nodes in the result (one for each node Id of the 
original document), or (ii) expose (preserve) in the result the origi- 
nal Ids from the document. The former case corresponds to what is 
called the copy semantics, under which the Ids of any document in 
D v are disjoint from those of d and from those of any other doc- 
ument in D v . Since in this case one cannot know whether nodes 
in different view extensions are in fact copies of the same node in 
d, the only possible rewritings are those that access a single docu- 
ment from D v and maybe navigate inside it. In the latter case, ev- 
ery document in D v preserves the original Ids, which will identify 
nodes across different documents in D v . One can thus formulate 
and exploit more complex rewritings, as node Ids can be used to 
intersect (join by Id) results of different views over the same input 
data d. TP n -rewritings q r extend TP-rewritings in that they can 
access several D v documents at once, by first navigating in indi- 
vidual documents and then intersecting the result. 

View compensation. As in [10, 9], for TP queries gi and q 2 , 
the result of compensating qi with q 2 , denoted comp(gi , q 2 ), is 
a TP-query obtained by deleting the first symbol from xpath(g2) 
and concatenating the rest to xpath(gi). q 2 is said to be the com- 
pensation of q\. For example, the result of compensating qi — 
a/b with q 2 — b[c][d]/e is the concatenation of a/6 and [c][d]/e: 
comp(gi, g 2 ) = a/b[c][d]/e. 

Intuitively, compensation brings further navigation over a view's 
extension and, by results revisited in Section 4 ([36, 3]), a determin- 
istic TP-rewriting will be of form q r — comp(doc(t>)/lbl(t;), . . . ). 

3.1 Problem Definition 

Encoding probabilistic view extensions. Let V be a p-do- 
cument, v a view. We generalize extensions to the probabilistic case 
by simply bundling v's results (nothing more, nothing less) in one 
p-document P v , which is rooted at a node having a special label 



doc(v), whose subtree is constructed as follows: (i) plug a unique 
i'nd-child below root^), (ii) for each pair (a, (3) in the set 

{(P',p) | V' subtree of V, (root(P').P) G 

add a as a subtree of this ind-node with the probability (3. 

The role of V v is to give direct access to all the results of the view 
v, by simply evaluating doc(v)/\b\{v) over this p-document; this 
does not mean that we assume nor exploit later on an independence 
property between view outputs (as the ind-node. may suggest). 

A set of p-documents Dy for the set of views V and unfolding 
of a query over Dy are defined as in the deterministic case. 

Note that, for ease of exposition, we make here a slight abuse of 
terminology: under both result semantics w.r.t. node Ids, within a 
view extension, Ids are not necessarily unique; the same Id - ei- 
ther preserved from the original document or a copy of one - may 
appear several times in the extension. While this is unnecessary in 
the deterministic context (and could be easily avoided, modulo iso- 
morphic results), it is necessary in the probabilistic one: to compute 
Pr(n G q(P)) for some node n, n needs to be properly identified in 
all its occurrences in a view's output, even for TP-rewritings (that 
use only one view and do not intersect results based on Ids). W.l.g., 
to simplify the presentation of one of our proofs we make these 
multiple occurrences directly accessible through queries, by the fol- 
lowing post-processing step over view extensions: we plug below 
each node n a child node with a fresh label "Id(n)". Also, w.l.g., 
even under copy semantics, an extension V v will be composed of 
subtrees of the original document instead of copies thereof. 

EXAMPLE 8. Continuing with Example 7, the view extension 

CPper)„i is in Figure 4, right. Each node n has a new child, 

y '"bon 

labeled "ld(n) " (whose own Id is omitted to avoid clutter). Also, 
("Pper)„| on has a root labeled doc(v% ^), with an ind-child un- 
der which there are the subtrees o/Pper rooted at ns and nj; the 
edges between this ind-node and its children are valued 1. 

Probabilistic view-based rewriting. Query answering using 
views in the probabilistic setting is more involved than in the deter- 
ministic one, as q(P) is a set of node-probability pairs. Therefore, 
one should deal with two sub-problems: (i) find a query in terms 
of views, that retrieves the nodes N of q(V) (this corresponds to 
deterministic rewritings) and ( ii) compute the probabilities for the 
nodes in N, using probabilities from Vy. Both sub-problems re- 
quire algorithms that access p-documents T>y only. Formally: 

DEFINITION 4. Let V ' be a p-document, q a TP query and V be 
a set of TP views over V, and let Q G {TP, TP n }. A probabilistic 
Q-rewriting Q r — (q r , f r ) of q using V is a pair of 

(i) a deterministic Q-rewriting q r - over random documents V — 
of q using V, and 

(ii) a probability function f r s.t. for every node nofV it holds 
thatfr(n,D%) = Pr(n G q(T)). 

When Dy is clear we will write f r {n) instead of f r (n, Dy). 

Hence the additional challenge that needs to be addressed in 
probabilistic view-based rewriting is to construct a probability func- 
tion f r that, by definition, has access only to the p-documents in 
Dy. In Sections 4 and 5 we respectively discuss when and how 
this is possible for TP and TP n -rewritings. 

4. TP-REWRITINGS 

When persistent node Ids cannot be exploited, only one view exten- 
sion Vv could be used in a rewriting, by means of navigation [36, 
3]. So a deterministic TP-rewriting q r could only be of the form 
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q r = doc(y)/\b\(v)\pi]\p2]..\p g ], q r = doc(y)/\b\(v)\pi]\p2]..\p g ]/p, 
or q r = doc(v)/\b\(v)[p 1 ][p 2 ]..[p g ]//p. 

for v £ V and the TP-queries p and pi (possibly empty) compen- 
sating v (with additional predicate conditions and navigation). 

Main ingredients. Let us fix the names for the main ingredients 
of this section: the input query q and the view v from the set of 
views V - to be used in a rewriting - formulated over p-document 
V , the integer k = |mb(w)|. Let n be one node for which we need 
to compute the probability Pr(n G q(P)) via the f r function, let 
ni , . . . , n a be the ancestor-or-self nodes of n that are selected by v 
(i.e., for which Pr(n t G v(V)) > 0). 1 

Notation for "splitting " queries. We revisit the terminology 
of [9], for "splitting" TP queries into a prefix, a suffix, or several 
tokens. A prefix q' of q is any tree-pattern that can be obtained from 
q by "moving up" the output mark, i.e., by setting as out(g') a 
node of mb(g) and interpreting what follows that node as predicate 
(side) branches. For any depth y, q^ is the prefix of q with y main 
branch nodes. A suffix q' of q is any subtree of q rooted at a node 
of mb(q). The suffix of q rooted at the node of depth y is denoted 
q( y ) . The main branch of q can be partitioned by its sub-sequences 
separated by //-edges, and each pattern corresponding to such a 
sub-sequence is called a token of q. We can thus see a tree-pattern 
q as a sequence of tokens q = ti// . . . //t x . The token t x , which 
ends with out(g), is the last token of q. 

EXAMPLE 9. The prefix 9 RBO n corres P on ds to the XPath 
IT personnel// per son[/ name/ 'Rick][bonus /laptop], the suffix 
"/9RBON rooted at the depth-2 node corresponds to per son[name 
/ Rick] /bonus[laptop]. Also, 5rbon can be split into tokens ti — 
ITpersonnel andt 2 = per son[/ name/ Rick] /bonus[laptop]. 
The following three queries that can be obtained from q or v will 
be often referred to in this section: 

• q': the query that can be obtained from the prefix q^ by 
removing all predicates of its output node, out(g ( - fc - ) ). 

• «': the query that can be obtained from v by removing all 
predicates of its output node, out(v). 

• q": the query obtained from q( k > by removing all predicates 
of nodes other than the output node out((j ( ' c) ). Formally, 

Q" = comp(mb(g (fc) ),( ? (fc) ) (fc) ). 

EXAMPLE 10. In our running example, for qrbon as the input 
query q, u B on as ^ v ' ew v ~ 3), 1^ is q itself, q' corresponds 
to the XPath IT — per sonnel// per son[name/ Rick] /bonus, q" 
corresponds to IT — per sonnel //per son /bonus[laptop] and v' 
is «bon itself since there are no predicates on out(v). 
We start by revisiting a key result from [36, 3], for deterministic 
rewritings based on compensation: 

FACT 1. [36, 3] Let q and V be TV-queries. There exists a 
deterministic TV-rewriting of q over V iff there exists v G V, with 
k = \mb(v)\, such that comp(«, q^) = q. 

In our example, we have comp(u BON , bonus[laptop]) = </rbon- 
Fact 1 can be verified in polynomial time [36]. As a reformula- 
tion of it, we have that comp(w, q^ k ) ) = q iff the following hold: 

q^ C v(C. v') and (v C.)v' C q . 

Fact 1 says that, using one view v from V, we can find all the nodes 
n G q(d) by querying d v with q( k ), i-e., the data in d v suffices to 
extract all such n. This naturally extends to a probabilistic setting: 

'Note that only the absence of //-edges in the main branch of the 
view or of the compensation guarantees that n's ancestor-or-self 

selected by v is unique, regardless of V (see Definition 5). 
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Figure 5: p-Documents witnessing non-existence of TP- 
rewritings (Examples 11 and 12) 

we can find all the nodes n from q(V) by querying V v with q<k), 
i.e., the data in V v it suffices to extract all ns. Note that n is in the 
query result q{V) iff Pr(n G q(P j) > 0. 

PROPOSITION 1. Let q and v be TP-queries, k = |mb(v)|. Let 
q r = comp(doc(ti)/lbl(t;), q(k)) be a deterministic TP-rewriting of 
q using v. Then for every p-document V the following holds: 

Pr(n G q(T)) > if and only if Pr(n G q r (P v )) > 0. 
The rest of this section is organized as follows. We first discuss the 
existence of probabilistic TP-rewritings, comparing with the de- 
terministic case and illustrating by examples the aspects on which 
may depend the construction of the probability-retrieving function 
f r . These will allow us to articulate the frontier between the feasi- 
ble cases and the unfeasible ones. We then give some general con- 
siderations on which our results for TP-rewritings are built (Sec- 
tion 4.2). Then, before describing the general solution, we discuss 
in Section 4.3 a particular case - the one of restricted rewritings - 
that allows (i) a concise and intuitive formulation of the f r compo- 
nent of rewritings, based mainly on probabilistic c-independence, 
and (ii) an efficient evaluation over the view extension, with no (or 
little) post-processing. We consider in Section 4.4 the general case, 
giving one additional necessary condition that along with the ones 
of Section 4.1 (Proposition 3 therein) will enable a sound and com- 
plete procedure for the existence of probabilistic TP-rewritings. 

4.1 Existence of Probabilistic TP-Re writings 

In the probabilistic setting, we first raise the question: is infor- 
mation in V v always sufficient to extract the probabilities Pr(n G 
q{V))for nodes n in q(P)? We show that the answer to this ques- 
tion is negative, and that there are q and v for which a deterministic 
TP-rewriting q r exists but not a probabilistic one (i.e., no func- 
tion f r exists such that for any V and node n G V it holds that 
f r (n) = Pr(n G q{V))). Thus, the probabilistic rewriting prob- 
lem is crucially different from the deterministic one. We present 
two examples (11 and 12) that give insight into this phenomenon. 

EXAMPLE 11. Consider the query q = a/b[c] and the view 
v — a[.//c]/b. We show that there is no probabilistic rewrit- 
ing (q r , f r ) for q over {v}. One can see that comp(v, q^)) = 
a[.//c]/b[c] is equivalent to q, so q r — comp(doc(«)/lbl(w), q^)) 
is a deterministic TP-rewriting of q using v. 

Consider now the two p-documents Vi and V2 from Figure 5. 
Clearly, Pr(b G q(Vi)) = 0.65 x 0.5 and Pr{b G q(P 2 )) = 0.5, 
and these probabilities are different. The function f r should com- 
pute the first probability 0.325 on a p-document (Vi) v and 0.5 
on (V2)v, hence f r should distinguish these p-documents. How- 
ever, one can see that these p-documents are indistinguishable by 
v: (Vi) v = {V 2 )v 2 Hence, f r does not exist. 

2 The probability of the b node is obtained directly as 0.65 in (Vi) v , 
and as 1 - (1 - 0.3) x (1 - 0.5) = 0.65 in (V 2 ) v . 
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The problem exposed in Example 1 1 comes from the fact that, 
in the unfolding a[.//c]/b[c], the predicate [.//c] coming from the 
view (whose probability of matching comes already "packed" into 
P v results, as a condition located above the compensation depth k) 
and the predicate [c] coming from the compensation (whose proba- 
bility needs to be computed from V v , as a condition at depth k) can 
interact. More precisely, the existence of a match for one predicate 
depends on the (non-)existence of a match of the other. 

C- Independent queries. We now introduce a notion of inde- 
pendence of queries, allowing us to capture necessary properties 
for the existence of probabilistic rewritings (it will also be used for 
TP n -rewritings). Two TP-queries qi and q^ are probabilistically 
condition-independent (in short c-independent) - denoted qi-Lq-2 - 
if, for every V and n G V, Pr(n G (gi n (72) CP)) equals 

[Pr(n G qi(V)) x Pr(n G q 2 (V))] -=- Pr(n G V). 

For instance, the queries <jbon and « B on are c-independent, 
i.e., (qbon-Lwbon)- The two TP-queries a[b] and a[c] are not c- 
independent, as we can easily construct V over which for some 
n G V, we have Pr(n G a[b]{V)) > and Pr(n G a[c]{V)) > 0, 
yet the probability of the joint test Pr(n G a[6][c](7 : ')) equals zero. 
Deciding probabilistic c-independence is tractable: 

PROPOSITION 2.c-Independence is decidable (PTime) in TP. 

PROOF SKETCH. We can use a syntactic notion instead of the 
probabilistic one for c-independent queries. Intuitively, it precludes 
dependencies between probabilites of predicates from the two que- 
ries to match over p-documents. The two definitions for c-indepen- 
dence can be proven equivalent, and testing for the syntactic one 
can be done in PTime, in particular via TP n -satisfiability tests. 
Due to space limitations, we refer the reader to the extended ver- 
sion [1 1] for the complete proof. □ 

Getting back to Example 11, intuitively, we need to ensure that 
predicates above depth k from the view a[.//c]/b do not interact 
with those at depth k from the query. We can capture this in general 
by testing whether v' _L q" . Note that in the example we have that 
v — a[.//c]/b and q" = a/b[c], and consequently v' JL q" . We 
can prove the following adaptation of Fact 1 to the probabilistic 
case, towards avoiding unwanted probabilistic interactions. 

PROPOSITION 3. Let q and V be TP-queries. Then there exists 
a probabilistic TV-rewriting of q over V only if there exists v G V 
as in Fact 1 - i.e., for k = |mb(v)|, comp(«, q^) = q - such that 
v'±q". 

PROOF SKETCH. Example 11 can be used as a generic contra- 
dicting construction, to show that when the c-independence condi- 
tion does not hold there can be no probabilistic rewriting (it illus- 
trates this situation in the simplest possible form). □ 
An immediate corollary of Proposition 3 is that fewer views may 
be used to rewrite a query than in the deterministic case: 

COROLLARY 1. [of Prop. 3] v must satisfy v' = q' , i.e, v and q 
are isomorphic modulo predicates at and below the depth k. 

We show next that the conditions of Proposition 3 are only nec- 
essary for the existence of a probabilistic rewriting in general. 

EXAMPLE 12. Consider the query q — a//b[e]/c/b/c//d and 
the view v — a//b[e]/c/b/c. Clearly, q^ is a compensation for v 
and q r = comp(u, <j( 5 )) is a deterministic TP '-rewriting for q and 
v. Also, the conditions of Proposition 3 are satisfied by q and v. 

On Figure 5 (right) we present two p-documents Vz and Va 
that show the non-existence of a probability function f r , such that 
(q r , f r ) is a probabilistic TP -rewriting for q and v. 

In the two documents, let n d denoted the node labeled d, let n c i 



and n c 2 denote the second node, respectively third node, labeled c 
(the ones selected by v). Clearly, we have that 
Pr(n d G q(V 3 )) = [0.4 x 0.3 + 0.6 x 0.4 - 0.3 x 0.4 x 0.6) = 0.288, 
Pr(n d G 9OP4)) = [0.3 x 0.4 + 0.3 x 0.8 - 0.3 x 0.4 x 0.8) = 0.264. 

A function f r that would be part of a probabilistic rewriting 
should be able to compute the probability value for n d to be se- 
lected by q in both (Vz) v and (Va) v , hence f r should distinguish 
these p-documents. But one can see that these p-documents are in- 
distinguishable by v (i.e., (Vz) v = (Vi)v) as in both documents 
n c i is selected by v with probability 0.12, while n c 2 is selected by 
v with probability 0.24. Hence, f r does not exist. 

The reason for which it is not possible to retrieve the right prob- 
ability values for q's answer is twofold: (i) there are multiple (two) 
c-ancestors of the d-node, whose probabilities would have to be 
taken into account for Pr(n d G q(V)), and (ii) the images of the 
b[e]/c/b/cpartofv (i.e., its last token) are not necessarily disjoint. 
This is because the sequence of labels (b,c,b,c) has a prefix - (b, c) 
- that is also a suffix thereof; we call it hereafter a prefix-suffix. As 
a consequence, in particular, the separate probability of the lower 
image ofb[e]/c/b/c is not computable because the [e] predicate 
might match in a part of the document that is never visible in view 
results (we only have access to n c i and its descendants). 

The above example illustrates the remaining aspects on which 
the existence of a probabilistic rewriting may depend. These have 
to do with the nodes m,...,n a G v(V) one may have to inspect to 
compute Pr(n G q(V)) (whether there is just one such ancestor-or- 
self node or there are several) and the last token of v (whether it has 
predicates, whether images of it in a document are always disjoint). 
These aspects will allow us to fully characterize the feasible cases, 
in Sections 4.3 and 4.4. We first give some general considerations 
on which our results for TP-rewritings are built. 

4.2 General Results 

To start, we can always formulate Pr(n G q(V)) as follows: 

a 

Pr(n G q(T)) = Pr(\/ [m G v [V) A n G q {k) (T n >)]). 
i=i 

Recall a is the number of n's ancestor-or-self nodes selected by v. 
So we can always formulate f r in terms of view extensions as: 

a 

f r (n) = Pr(VK G v'(V) A n G q W {V?)]). 

Let e t denotes the event m G v'(V) A n G q(k)('P" i ), for 
i = l,o. By the inclusion-exclusion principle, we can give the 
following general formulation of the f r function: 

LEMMA 1. f r can always be formulated as follows: 

a 

f r (n) :=Pr(\/ei)= ^ Pr( ei ) - ]T Pr(e n n e; 2 ) + . . . 

1 — 1 t *1)«2 

+ ...(-iy- 1 P r (f}e l ). (1) 

i 

Under the independence condition v' _L q", the following holds: 
LEMMA 2. Under the conditions of Proposition3 -i.e., fork — 

\mb(v)\, comp(i;, g( fc )) = q and v' _L q" - we can compute the 

probability of an event as follows : 

Pr( ei ) =[Pr(n t G v(P)) + Pr(n; G v w {V^))] x 

x Pr(ne« w (K')). (2) 
PROOF SKETCH. Immediate by the following reformulations: 
Prfo) = Pr(m G v'(V)) X Pr(n G q (k) {V^) \ m G v'{V)) 
= Pr(me v'(V))xP r (ne q (k )(V^))- 



1153 



Pr(m e v'(V)) = [Pr(n; e v{T)) 4 Prfn; G « (fc) (K 4 ))]. □ 

Note that Lemma 1 does not imply that f r is always computable 
(recall Example 12). In fact, we will show in Section 4.4 that the 
probability of joint ei events may not be computable from V v . Be- 
fore discussing this, we consider a restricted case (where v selects 
an unique ancestor-or-self node of n, i.e., a = 1), for which there 
is no need to manage joint events. 

4.3 Restricted TP-Rewritings 

DEFINITION 5. We say that a TP-rewriting using a view v and 
compensation c is restricted if either mb(v) has no //-edges, or 
mb(c) has no // -edges. 

The case of restricted rewritings was brought forward by the fol- 
lowing question: In the deterministic case, the XML answers q(d) 
can be obtained by executing the alternative plan over the view ex- 
tension d v (a deterministic XML document); since, by construction, 
in the probabilistic case, the view extension Vv is a p-document, 
would it be enough to query it with the rewriting and get "for free " 
both the XML nodes n of q(P) and their probabilities Pr(n G 
q(P))? This would represent, in our view, the most intuitive for- 
mulation of the f r function, requiring no post-processing after the 
querying phase. We show that this is indeed possible for restricted 
rewritings under Proposition 3's conditions. More precisely, we 
show that this approach works modulo one minor adjustment: ac- 
count for certain probability values (of out (v) predicates to match), 
which come already "packed" into results of v; these have to be di- 
vided away since they will be re-accounted for by compensation. 

Let n a be the unique ancestor-or-self of n that is selected by v. 
The f r formulation from Eq. (1) can be now simplified, reflecting 
the following: the probability Pr(n G q{V)) that n occurs in q's 
result amounts to the probability Pr(n G q r {Pv)) that n is selected 
by q r in V v , divided by the probability Pr(n a G V(k){P" a )) that 
n a verifies the predicates on out(t>). 

THEOREM 1. Letq r = comp(cfoc(«)/lbl(«), q(k)) be a restricted 
deterministic TV-rewriting of q using v. Then, there exists a prob- 
abilistic TV-rewriting (q r , fr) of q using v if and only if 
v' _L q" . Moreover, f r can be computed as follows : 

Pr(n G q(Pj) = Pr(n G q r (V v )) 4- Pr(n a G w (fc) (K a ))- 
PROOF SKETCH. We start by assuming that v' _L q", which 
was already proven a necessary condition. We discuss next how f r 
can be computed when this c-independence condition holds. 

Let us first assume that there are no predicates on out (v), which 
implies that v = v'. Hence, for the assumed node n a , we have that 
Pr(n a G V( k )(T na )) = 1. Knowing this, let us now understand 
whether the equality we aim for can indeed hold, namely: 

Pr(ra € q{V)) = Pr(n € q r {V v )). 

By V v 's definition, we can write the right-hand side as follows: 

Pr(n G q r (V v )) =Pr(n a G doc(v)/\b\(v)(V v )) x 

Pr (of matching the rest of q r from n a in V v ) 

= Pr(n a G v{V)) x Pr(n G g (fe) (P" a )) 

= Pr(n a G v(V)) x Pr(n G q (k) (V na )) 

= Pr(n G q(V)). 

The first two reformulations are immediate by V v 's construction, 
the third one is enabled by the c-independence condition v' _L q" . 

To complete the proof, if there are predicates on out(w), by the 
same c-independence condition we have that 

Pr(n a G v'(P)) = Pr(n a G v{V)) 4- Pr(n a G v (k) (P" a )), 

and then we can use v' in the previous line of reasoning. □ 



As an immediate corollary of Theorem 1, when v has no predi- 
cates on the output node, the f r function becomes the intuitive one: 

Pr(n € q(V)) = Pr(n € q r (V v )), 

hence a simple interrogation of the view extension with the plan 
comp((ioc(t;)/lbl(u), q^) would retrieve both the XML data and 
(for free) their probability. 3 

EXAMPLE 13. A deterministic TV-rewriting for qsoN is q r = 
comp(doc(Vft ON ) / bonus, ?(3)), and this plan is obviously restricted. 
Observe that (vbonY -L (<7bon)" (and {vbonY = (<Jbon)'). so 
Theorem 1 applies. So the probability Pr(n 5 G QbonC'Pper)) is 
Pr(n 5 G gr((Pp ER )„2 ON )) - Pr(n 5 G («bon)(3)(^per)) = 
0.94-1. Besides n$, for all other nodes rii, Pr(n^ G i?bon('Pper))= 
since Pr(m G gr(CPpER)„2 oN ))= 0. 

4.4 Unrestricted TP-Rewritings 

We now consider the general case, starting with an additional 
necessary condition that, along with the ones of Proposition 3, en- 
ables a sound and complete procedure for the existence of proba- 
bilistic rewritings. In order to go beyond the scope of the previous 
section (i.e., restricted rewritings), we must assume that (i) the view 
v has at least one //-edge in the main branch, and (ii) the compen- 
sation (jYfc) has at least one //-edge in the main branch. 

We show that the remaining ingredient for deciding whether a 
probabilistic TP-rewriting exists is the last token of v. Let t be 
this token, of the form t = h[Qi]/ ■ ■ ■ /lm[Qm], where l m — 
\b\(v) and any of Q\, . . . , Q m may be empty. Also, let u denote 
the length of the maximal prefix-suffix of the sequences of labels 
(h, . . . , l m ), so < 2 x u < m. Hence, when u > 1, we can 
write t as follows: 

ll[Ql]/h[Q2]/-/lu[Qu]/:./ll[Qm-u+l]/-/lu-l[Qm-l]/lu[Qm] 

EXAMPLE 14. Revisiting Example 12, the last token of our view 
is b[e]/c/b/c, for which the sequence of labels of the main branch, 
(b, c, b, c), has a maximal prefix-suffix of length 2. Hence in the 
example we have u — 2. 

We can now give our main result for unrestricted rewritings. 

THEOREM 2. Let q and v be TV-queries s.t. there is a determin- 
istic, non-restricted TV-rewriting q r =com-p(doc(v) /\b\(v),q^)). Let 
u be the size of the maximal prefix-suffix ofv 's last token. There ex- 
ists a probabilistic TV-rewriting (q r , f r ) of q using v iff 

1. v' _L q" (the condition of Proposition 3), and 

2. the first u — 1 nodes ofv 's last token have no predicates. 
Moreover, f r can be computed as in Equation (1). 

PROOF SKETCH. Example 12 can be used as a generic contra- 
dicting construction, to show that when some of the first u — 1 
nodes of v's last token have predicates there can be no probabilis- 
tic rewriting (observe that it illustrates this situation in the simplest 
possible form). We describe in the rest of the proof one possible 
way to build f r when the condition holds, via queries that exploit 
the special Id(n) nodes we introduced in view extensions. 

The individual probability of events can be computed as in 
Eq. (2). We detail next the probability of joint ei events. 

Case u = 0. When the label sequence (h, . . . , l m ) has no prefix- 
suffix, any probability of the form Pr(ei n ef), for m ancestor of 
rij , can be computed as follows. 

3 Note that, for any p-document V, any node n s.t. Pr(n G 
qr(V v )) > 0, if there is only one ancestor-or-self n a of n s.t. 
Pr(n G v(V)) > 0, Theorem l's approach is sound and com- 
plete, regardless of whether q r is restricted or not. However, in the 
case of unrestricted plans, this approach would be data-dependent. 



1154 



INPUT : TP query q and views V 
OUTPUT: Set of TP-rewritings R 

R := 0; 

for each v G V do 
k := |mb(t>)|, 
t : = last token of v, 
u := size of maximal prefix-suffix in t 
if comp(doc(v) /\b\(v), q(k)) = <? then 

:= the prefix of q of size k 
v ■= v w/o predicates at node of depth k (out(v)) 
q" := compCmb^),^]) 

if v' JL q" then continue; 

if comp(e/oc(i;)/lbl(i;), qik)) i s restricted then 
R:= RU {comp(efoc(t;)/lbl(i;), q (k ))} 

else if no predicates on the first u — 1 nodes in t then 
R:= RU {comp(cfoc(v)/lbl(», q (k) )} 



Assuming that m G v'(V) already, we construct a TP n -pattern 
a that will test in V the remaining conditions for ej n ej : 

a = q (k) n comp(/ m //ii[Qi]/ . . . /l m [Q m ][Id(nj)],q w ). 

Knowing m G v'(V), we can then test by n G a(P ni ) all the 
remaining conditions for e» A ej . More precisely, we test that: 

• rij is also selected by v'\ for this, only the conditions of the 
last token need to be tested since the rest matches already for 
ri; to be selected by v', 

• n is selected by q^ k ) from rn (left-hand side of intersection) 

• n is also selected by g (fc) from rij (compensation in right- 
hand side of intersection). 

Therefore we now have the following: 

Pr( ei n ej) =Pr(m G v'(V)) X Pr(n G a{V ni ) \ n, G v'(V)) 
=Pr(m G v'{V)) X Pr(n G a(P„"') | n, G t/(7>)) 
=Pr(n; G v'{V)) X Pr(n G 
= [Pr(n t G i-(P)) - Pr(n, G l m [Qm](V^))] X 
X Pr(nga(P„"')). 
We can take the steps in a because different images in V v of the 
part h I ■ ■ ■ jlm of v are necessarily disjoint. The second reformu- 
lation is an immediate consequence of v' _L q" . The third one 
follows also from it, since t>( fc ) = l m [Qm]- So Pr(e; n ej) can 
be computed using v's results, with the special representation of Id 
values allowing us here a simpler formulation for it. 

Any conjunction of up to a events can be computed similarly. 
Case u > 1. Let (h, . . . , l u ) be the maximal prefix-suffix of the 
sequence (h, . . . , l m ), and assume that there are no predicates on 
the first u — 1 nodes of v's last token t of the form: t = 

h/h/ ' ■■■/lu-l/lu[Qu]/ ' ■■■/h [Qm-u + l]/ ■■■/lu-l [Qm-l\/lu[Qm] 

We describe below the formula for Pr(ei n ej), in the case of two 
events e t and e j . For rii ancestor of n j , let s (i , j ) denote the num- 
ber of data nodes from to rij in V, including these two nodes (we 
can always get the s(i,j) values from V" 1 ). The formula for the 
probability of joint events will change slightly (via the a pattern). 

If s(i, j) > m, the a pattern and the probability formula remain 
the same as in the case of u — 0. 

Otherwise, if s(i,j) < m, let a be defined by the TP n -pattern: 

" = g(k)<^ co ^P( l m-s(i,j) + llQm-s(i,j) + l]/--/ l m[Qm]lId(nj)],q^ k - ) ) 

Then, we can formulate Pr(ei n ej) as before: 
Pr( ei Dej) =Pr(n t G v'(V)) X Pr(n G «(?"■) | n, G v'(V)) 
=Pr(n, G v'(V)) X Pr(n G a(V^) \ n, G v'(V)) 
=Pr(n t G v'(V)) X Pr(n G a(V^)) = Pr(n G a(V^)) 
X [Pr(n, G v(V)) ~ Pr(ni G l m [Qm](V^))] 
A similar approach can be used to compute any conjunction of up 
to a events under the assumption that there are no predicates on the 
first u — 1 nodes of v's last token. □ 

We summarize the results of this section with algorithm TPrewrite 
(Figure 6), which takes as input a TP query q and a set of views V 
and returns all possible deterministic TP-rewritings q r which can be 
complemented by a f r function, for a probabilistic TP-rewriting. 

PROPOSITION 4. TPrewrite is sound and complete for deciding 
whether a probabilistic TP-rewriting of a query q over views V 
exists. It runs in PTime in the size of the query and views. 
Remark. Note that, in the case of probabilistic TP-rewritings, 
there is a complexity separation between the decision problem for 
the existence of a rewriting - which is tractable - and the one of 
executing the alternative access plans based on views - which can 
be done in EXPTime. Exponential time in the size of the query and 
views is unavoidable in practice, since TP-query evaluation over 



Figure 6: Algorithm TPrewrite for probabilistic TP-rewritings 

p-documents (and view extensions) is intractable in query size. We 
strongly conjecture that the same complexity bounds should remain 
valid for the evaluation of intersections of tree pattern queries, as in 
the deterministic case. Although our general formula from Eq. (1), 
for the probability-retrieving function, can be exponentially large in 
the size of the view result (by the inclusion-exclusion formulation), 
it can be reformulated into one that remains tractable in the size of 
the data, in a rather technical but not very complex manner. For 
space reasons, these details are omitted. 

5. TP-REWRITINGS 

We consider in this section the problem of view-based rewriting 
over probabilistic data in the presence of persistent node Ids, using 
TP n -rewritings, i.e., intersections of possibly compensated views. 
The pattern q r of a TP n -rewriting will now be of the form (~) { . Uij , 
where each u t j is a TP-rewriting over some view Vj , i.e., a possibly 
compensated view. 

Let V = {vi, . . . , v m }, with rn — \V\, be the set of TP views 
to be used in a rewriting (each Vi contains q or a prefix thereof). 
Given an candidate plan q r — . u t j in TP n , verifying that it 
is a deterministic rewriting of q in TP can be done by verifying 
unfold v (q r ) = q (see [10]), which in turn amounts to testing that 

(i) each TP query in unfold v {uij) contains q, and 

(ii) q contains the TP n query unfold v (q r ). 

Before discussing TP n -rewritings, we recall how one can decide 
containment and equivalence between a TP query and a TP n one. 

5.1 Equivalence and Containment for TP n 

Since Q in TP n is a rewriting for q in TP iff unfold v (Q) = q, 
deciding whether a TP query q is equivalent to a TP n query Q 
is a crucial step for our problem. It is known [10] that one can 
rely on mappings to decide whether q = Q. For that, Q can be 
first equivalently reformulated into the union of TP queries U;Qi, 
called its possible interleavings, which can be exponentially large 
in \Q\. Interleavings capture all the possible ways to order or coa- 
lesce the main branch nodes of queries participating in the intersec- 
tion. 4 Testing q = Q was shown to be coNP-hard and boils down 
to testing q = UiQi, which in turn boils down to testing that (i) for 
some j, q C. Qj, and (ii) for all i, Qi jZ q. (This is reminiscent of 
results from relational databases, on comparing conjunctive queries 
with unions of conjunctive queries.) We can immediately conclude 
that the following also holds: 

4 I.e., ways that are not leading to unsatisfiability. 
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COROLLARY 2. Deciding the existence of a probabilistic TP n - 
rewriting for a TP query q and TP views V is coNP-hard. 

The equivalence problem was however shown in [10] to be solv- 
able in PTIME when q belongs to a restricted fragment of TP, 
called extended skeletons. 

Extended skeletons. Informally, this fragment limits the use of 
//-edges in predicates, in the following manner: a token t of a TP 
query v will not have predicates that have //-edges and that may 
become redundant because of descendants of t and their respective 
predicates in some interleaving v might be involved in (by inter- 
secting v with some other query). To define extended skeletons 
we use the following additional terminology: by a // -subpredicate 
st we denote a predicate subtree whose root is connected by a //- 
edge to a linear /-path I that comes from the main branch node n 
to which st is associated (as in n[...[.//st]]). I is called the in- 
coming /-path of st and can be empty. Extended skeletons are pat- 
terns v having the following property: for any main branch node 
n and //-subpredicate st of n, there is no mapping (in either di- 
rection) between the incoming /-path of st and the /-path following 
n in the main branch (where the empty path is assumed to map in 
any other path). For example, the expressions a[b//c//d]/e//d or 
a[b//c]/d//e are extended skeletons, while a[b//c]/b//d, a[b//c]//d, 
a[.//b]/c//d or a[.//b]//c are not. 

This TP sub-fragment does not restrict in any way the use of //- 
edges in the main branch or the use of predicates with /-edges only. 

As our focus in this paper is on efficient algorithms for view- 
based rewriting, it is thus natural to ask if over probabilistic data 
this problem remains tractable, when we deal with extended skele- 
tons under persistent node Ids. Our general approach hereafter will 
be to describe decision and evaluation techniques that are sound 
and complete when applied to any queries and views, but may 
depend, unavoidably, on equivalence tests involving TP n -queries. 
Therefore, their complexity will depend on the one of such tests. 

5.2 Using Pairwise c-Independent Views 

As for TP-rewritings, we present our results starting with the as- 
sumption that a deterministic rewriting q r has been found. Without 
loss of generality, let us first assume that q r consists only of inter- 
sected views (plans with possibly compensated views are discussed 
in Section 5.4). W.l.o.g., q r can be of the form q r = doc(vi)/vi n 
• • • (~1 doc(v m )/vm and, necessarily, gCn, for all 

We first give some intuition on the possible construction of the 
probability component of the rewriting, f r : for a given node n G 
q(V), since each view Vi gives a probability value, Pr(n G Vi (V)), 
and since we are interested in the probability of the intersection 
thereof, we might be tempted to try what is arguably the most in- 
tuitive definition for f r here, the one which would simply combine 
by multiplication the values Pr(n G Vi(V)). There are however 
two issues with this straightforward f r candidate. 

The first issue is probabilistic dependencies. We have introduced 
the notion of c-independence in the previous section, which can 
guarantee that the existence of some embedding of a view Vi in 
a given document does not depend - w.r.t predicate conditions - 
on the existence (or non-existence) of some embedding of another 
view Vj in this document. We will see now that, for pairwise c- 
independent views, a function f r based on multiplication of these 
views' probabilities can be built. 

The second issue has to do with an adjustment for a probability 
term that appears in each of the views' probability values. More 
precisely, for each node n that appears in q(V) and, consequently, 
appears in each vi(V), . ■ ■ , v m {V), we have m probability values 
Pr(n G Vi(P)). Furthermore, each value Pr(n G Vi(V)) can be 
seen as the product of two distinct probability terms: 



(i) the (appearance) probability of n being part of a possible 
world of V , denoted Pr(n G V), 

( ii) the probability of n being selected by Vi in a possible world in 
which n is known to appear, denoted in the following Pr(n G 
Vi(V) \n£p). 

Note that the first term is independent of any particular view to 
whose result we may be referring, as it only depends on the docu- 
ment itself (this is reflected by our notation). We can thus write for 
each Vi and node n that 

Pr(n G Vi{V)) = Pr(n G V) x Pr(n G Vi{V) n G V). 

Given a deterministic rewriting q r of q formed by pairwise c-inde- 
pendent views vi, ■ ■ ■ , v m , for a node n G q(V)), we would thus 
have as the overall product the following formulation: 

]]Pr(n G Vi(V)) = Pr(n G P) m x]^[Pr(n G Vi(V) \ n G V). (3) 

i i 

Therefore, in Eq. (3) we account for the probability Pr(n G V) 
too many times, once for each view that participates in the rewrit- 
ing, although we should account for it exactly once. By dividing 
Eq. (3) with Pr(n G 'P) m_1 , we can correct this overuse of n's 
appearance probability, obtaining the following f r formula when 
the views vi , . . . , v m are pairwise c-independent: 

f r {n) = Pr(n £p)x[| Pr(n G Vi{V) \neV) (4) 
= Y[Pr{n£v i {V)) + Pr{n€V) rn -\ (5) 

i 

Each c-independent view Vi gives us Pr(n G Vi(P) | n G V), 
but there is still one missing ingredient in order to be able to com- 
pute f r as in Eq. (4): n's appearance probability value, Pr(n G V). 

We can prove the following: 

LEMMA 3. Pr(n G V) is computable from P V1 , ... ,P Vm iff 
there exists one Vi verifying rab(q) C v». 

Using Lemma 3, we sum up the positive results of this section in 
the next theorem: 

THEOREM 3. Let qbe a TP query, vi , . . . , v m a set of pairwise 
c-independent TP views s.t. there exists a Vi satisfying mb(q) C Vi. 
Let q r be a deterministic TP n -rewriting of q, of the form 

q r = doc(vi)/vi n • • • n doc(vm)/v m . 

Then, (q r , f r ) with f r as in Eq. (4), is a probabilistic TP n -rewriting 
of q over V. 

PROOF SKETCH. We refer the reader to [11], for the formal 
proof based on the material that precedes in this section. □ 

EXAMPLE 15. Consider the following (compensated) view for 
«bon> v = comp(doc(v^ ON )/ bonus, 5(3)). Clearly the views 
v bon an d v are c-independent, and £/rbon = w bon ^ v. Ac- 
cording to Theorem 3, the probability Pr(ns G Qrbon(T^per)) 
equals 0.75 x 0.9 -7- 1 by the expression 

Pr(n 5 G -UbonCPper)) x Pr(n 5 G u(Pper)) -t- Pr(n 5 G Pper). 
Forotherrn, Pr(n; G ?RBON(7 :> pER))=Pr(ni G WbonCPper))^. 

We next show that, even for very limited (//-free) input queries and 
views V , it is hard to decide the existence of a subset of pairwise 
c-independent views from V on which a rewriting as in Theorem 3 
can be built. This implies that, for extended skeletons as well, it is 
hard to find TP n -rewri tings by Theorem 3's approach. 

THEOREM 4. Let TP query q and TP views V be without //- 
edges. Then, deciding whether there exists a TP n -rewriting of q 
using only pairwise c-independent views from V is N P-hard. 
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PROOF SKETCH. We prove this by reduction from the prob- 
lem of fc -DIMENSIONAL PERFECT MATCHING: Given a k- 
hypergraph H = (U;E) with s = \U\ and m = \E\, is there 
a subset S C E of s/k hyperedges such that each vertex ofU is 
contained in exactly one hyperedge of S? 

Let U — {ui, . . . , u s } and let E = {ei, . . . , e m } denote the 
nodes and edges of the k-dimensional hypergraph H. 

We build an input query q of the form q = o[l]/a[2]/ . . . /a[s]//b. 
For each edge e, G E, we build a view Vi as follows: a sequence 
of s a-labeled nodes separated by /-edges, followed by a //-edge 
and then a ^-labeled node. On the a-nodes, the predicates corre- 
sponding to the vertices of e, are present. For example, for an edge 
ei = (1,2,3) (for k = 3) we would construct the following view: 

Vi = a[l]/a[2]/a[3]/a.../a//b. 

Now, if a perfect matching for H exists, one can notice that the 
views corresponding to the edges of this matching should be c- 
independent. Their intersection gives an equivalent rewriting both 
in the absence and in the presence of probabilities. Vice-versa, if 
we find a rewriting of q using views, this means that we have a set 
of views which are c-independent and cover all the predicates of q. 
But this amounts to finding a perfect matching for H. □ 

5.3 Using View Decompositions 

Theorem 4 shows that it may not be possible to find a TP n -rewriting 
based on c-independent views efficiently. It may however be pos- 
sible to build probabilistic rewritings, without requiring pairwise 
c-independence. We describe in this section a general, sound and 
complete approach, which attempts to build a system of probability 
values from the views' results, even in the presence of dependent 
views. We first give the intuition behind it with an example. 

EXAMPLE 16. Consider the input query q — a[l] / b[2] / c[3] / d, 
and the views vi = a[l]/b/c[3]/d, t>2 = a/b[2]/c[3]/d, t>3 = 
a[l]/b[2]/c/d, andvi = a//d. 

Note that the first three views are not c-independent with each 
other. Their intersection does yield a deterministic TP n -rewriting 
of q (in fact, v\ and v 2 would suffice for a deterministic rewriting 
of q). Moreover, V4, which gives us the probability Pr(n G P) 
for any node n in q's result, would be considered redundant in the 
deterministic setting ( it contains the other three views ). 

For a probabilistic rewriting, it remains to specify the probability 
function f r . However, due to probabilistic dependencies between 
the views, it is no longer possible to simply multiply the individual 
probabilities from vi, V2, V3 (V4 gives the values Pr(n G P)). 

However, one could retrieve the values Pr(n G q(P)), by slightly 
more involved arithmetic manipulations, starting from the obser- 
vation that these can be seen as the product of four independent 
probability values: Pr(n G P) and one value for each of the three 
predicates. With a slight abuse of notation, for j G {1,2,3} let 
Pr(J ) denote the probability of the predicate [j] to match at the cor- 
responding depth. The following system of equations can then be 
built straightforwardly, for each n such that Pr(n G q{V)) > : 

Pr(n G vi{V)) = Pr(n £ P) x Pr(l) x Pr(3), 

Pr(n G v 2 (Pj) = Pr(n £p)x Pr(2) x Pr(3), 

Pr(n G v 3 (Pj) = Pr(n eP)x Pr(l) x Pr(2), 

Pr(n G v 4 (P)) = Pr(n G V), 
which would allow us to obtain easily the probability values 

Pr(n G qiP)) = Pr(n G V) x Pr(l) x Pr(2) x Pr(3), 

provided the system allows an unique solution for the unknown 
value ofPr(n G q(P))- This condition can be verified indepen- 
dently of any node n (i.e, it is not data dependent). 



Outline of the general approach. The idea illustrated by 
Example 16 can be generalized into an algorithm that applies to 
any query and views, without being data dependent. In principle, 
we will still rely on probability terms that are independent, but at 
a more fine-grained level. More precisely, we will decompose the 
set of views V = {v\, . . . v m } into a set of pairwise c-independent 
view decompositions (in short d- views; these are queries from TP 
as well) denoted wi, . . . ,w a , and then use d- views instead of the 
given ones. The major difference w.r.t. the setting of Theorem 3 
is that now we will not have the explicit probabilities of d-views' 
results, but only some combinations thereof (from the results of 
the given views), in a non-homogenous linear system. Based on 
this system, we will then describe a decision procedure (sound and 
complete) for the existence of the f r function, running in PTime 
in the size of the query and of the initial set of views. 

We start by describing how we move from the initial set of views 
to the set of d-views wi, . . . , w s , by the following four steps. 

We can see each view Vi as being of the form u; = fU // rrii //IU, 
where ft t is the first token, It < is the last token, and mj denotes the 
rest (rrii may be empty; if rrn is empty, IU may also be empty). 

Step 1. For each Vi, build the TP queries to, 1 ,..., as follows: 

(i) |mb(/tj)| + |mb(Z£i)| queries: one query for each main branch 
node n of either fti or Iti, obtained from v by removing all 
predicates from v except the ones on n. 

(ii) One query of the form mb(/i;)//ra;//mb(Zii), i.e., obtained 
from Vi by keeping only the predicates of the ra» part. 

Step 2. For each view Vi and its wj queries obtained at the previ- 
ous step, repeat until no change occurs the following: replace any 
two queries wf,wf s.t. wf JL w\ by meu ~ intersection wf n wf. 
Step 3. Replace each query obtained at the previous step by its 
intersection with the linear query mb(q). 

Step 4. Across the m sets of queries obtained at the previous step, 
group the queries into equivalence classes (by query equivalence). 
Then, by introducing one d-view name for each equivalence class, 
output the final set of d-views {wi, . . . ,w s }. 

For each of the initial views Vi, let Wi C {w\, . . . w 3 } denote 
the d-views into which it is decomposed. We can now write the 
following equation for each view Vi : 

Pr(n 6 Vi(V)) = Pr(n 6 V) X ]J Pr(n e \ n eV) (6) 

wj e Wi 

For the input query q, let W q C {w\ , . . . w a } denote the d-views 
into which q can be decomposed. We have an additional equation: 

Pr(n G q(V)) = Pr(n G V) X ]"[ Pr(n G Wj(P) \ n G V). (7) 

Let S(q, V) denote the non-homogenous system of m + 1 linear 
equations that can be obtained from Eq. (6) and Eq. (7), by taking 
the logarithm. Note that S(q, V) has s + 2 variables: s variables 
corresponding to Pr(n G Wj(V) \ n € V) terms, one variable for 
Pr(n G P), and one distinguished variable for Pr(n G q{P))- 

We are now ready to present our main results for probabilistic 
TP n -rewritings : 

THEOREM 5. Let q be a TP query, V = {vi, . . . , v m } be a set 
of TP views containing q. Let q r be a deterministic TP n -rewriting 
ofq, oftheformq r = doc(vi)/viC\- ■ -Ddoc^m) /v m . Thereexists 
a probabilistic TP n -rewriting, of the form (q r , f r ), if S(q, V) ad- 
mits an unique solution for Pr(n G q(P))- Unless mb(q) has only 
/-edges, such a probabilistic TP n -rewriting exists only if S(q, V) 
admits an unique solution for Pr(n G qiP))- 

PROOF SKETCH. For each of the views participating in q r , the 
decomposition into a product of independent terms by Eq. (6) is 



1157 



INPUT : TP query q and views V 
OUTPUT: canonical rewriting q r 

R := 0, V := V, V" := V 

Prefs := {(Vi, a) | C Vijor q^ being the prefix 
of size a of q}; 

for each v £ V do 
k := |mb(v)|, 
t := last token of v, 
u := size of maximal prefix-suffix in t 

for each (v, a) £ Pre/s do 

V" := Vu{comp(i>,g (a) )} 

u' := i) w/o predicates at node of depth k (out(v)) 

q" := ccmp(mb(v),q$) 

if v' JL q" then continue; 

if comp(cfoc(i>) /\b\(v) , ?(»)) « restricted then 
V" :=y"U{comp(«, g(o) )} 

else if no predicates on the first u — 1 nodes in t then 
^":=V"U{comp(t;,9 {o) )} 

9r = Dv 4 €V doc(vi)/vi 

if unfold(q r ) = q then 

if V") /ifli unique solution for Pr(ra £ g('P)) then 
L return true 

Figure 7: TPIrewrite for probabilistic TP n -rewritings 



sound. Moreover, it is the maximal one (w.r.t. number of terms) or, 
put otherwise, the most fine-grained, since dependencies between 
tests of the views must occur within the same d-view. Any other re- 
formulation of the view probabilities is necessarily either subsumed 
by S{q, V) or equivalent to it modulo renaming of variables. 

In particular, predicates of the same node must be part of the 
same d-view (the probabilities for them to match are obviously not 
independent). Note that we can refer to predicates of the first or last 
tokens of views - and their probability to match - unambiguously, 
since the main branch nodes of these tokens are unambiguously 
identified on the path from the root of the p-document to some re- 
sult node n selected by q. But this is not the case for predicates of 
the rrii parts of views, and the reason we need to consider them "in 
bulk", by a single wf expression corresponding to all of them. 

At Step 3 (intersection with mb(g)), we simply explicit the fact 
that nodes n we are interested in must be found on the path match- 
ing mb(g), as they verify Pr(n G q(V)) > 0. Omitting this step 
would keep the approach sound, but may cause us to miss oppor- 
tunities to reuse the same variable across distinct views (and ulti- 
mately find the f r function). (For a detailed proof, see [11].) □ 

PROPOSITION 5. Let q be a TP query and V = {vi, . . . , v m } 
be a set of TP views s.t. there exists a deterministic TP n -rewriting 
of q, of the form q r = doc{v{)/v\ n ■ ■ ■ H doc(v m ) / v m . Testing 
whether the system S(q, V) admits an unique solution for Pr(n G 
qiV)) can be done in PTime, modulo TP n -equivalence tests. 

PROOF SKETCH. Finding a deterministic rewriting requires an 
equivalence test of the kind described in Section 5.1, hence a worst- 
case exponential step. 

Regarding the f r component for a probabilistic rewriting, in the 
S(q, V) construction, by the way intersections of patterns are con- 
structed at Step 2, these TP n queries reduce trivially to equivalent 
TP ones (they are union-free, in the terminology of [8]). So we can 
safely assume that each run of Step 2 deals only with tree patterns, 
instead of intersections thereof. At Step 4, the d- views are obtained 
based on equivalence tests, which may involve TP n queries. 

Then, testing if S(q, V) admits an unique solution for Pr(n G 
q(V)) can be done in polynomial time by straightforward linear 
algebra manipulations. Note that this does not necessarily mean 
that S(q, V) admits an unique solution for all its variables. □ 

5.4 Dealing with Compensated Views 

We consider in this section general TP n -rewri tings that, before 
performing the intersection step, might compensate (some of) the 
views. We show that rewriting in this new setting can be reduced 
to the one discussed in the previous section, by relying also on the 
results of Section 4. This allows us to reuse the same PTime algo- 
rithm and to find strictly more rewritings, namely those that would 
not be feasible without compensation. 

The general approach will be the following. Starting from the 
given set of views V , all containing the input query q or a prefix 
thereof, we will expand V into V' by adding to it all possible com- 
pensated views of the form comp(t>, q( a yj, for v G V and a being 
a depth in the range 1 to |mb(<j)|. As in [8], the views of V' will 
then be used to build what we call the canonical deterministic plan 
q r = n«-6V' doc(vi)/vi. For a probabilistic rewriting, among the 
views of V', we select the subset V" of those that (i) either were 
originally in V, or (ii) verify certain conditions that ensure their 
result probabilities can indeed be computed from the initial results 
of the view they were constructed on (using the decision procedure 
described in Section 4). 

Algorithm TPIrewrite (see Figure 7) details this approach, and it 
represents a decision procedure for finding TP n -rewritings based 
on possibly compensated views. It takes as input a TP query q and 



a set of TP views V and returns the canonical deterministic TP n - 
rewriting q r , whenever the f r function can also be built. 

We can prove the following main result for TP n -rewritings: 

PROPOSITION 6. TPIrewrite is sound for testing if a probabilis- 
tic TP n -rewriting of a query q over views V exists. It is also com- 
plete, unless mb(q) has only /-edges. 

Modulo TP n -equivalence tests, TPIrewrite runs in PTime in the 
size of the query and views. 

Potentially expensive equivalence tests may be performed when 
we verify whether q r is a deterministic rewriting (step before last 
in TPIrewrite) or in the construction of the S(q, V) system. But 
these can be performed efficiently when we deal with the restricted 
fragment of extended skeletons. 

COROLLARY 3 (OF PROP. 6). If the views V and query q are 
extended skeletons, TPIrewrite runs in PTime in the size of the 
query and views. 

Remark. Note that the evaluation of the rewriting on the view 
extensions might require first the evaluation of any compensated 
view used in q r and f r , hence may require exponential time in the 
size of the query and views, which is not surprising given that the 
evaluation of TP queries over probabilistic XML is intractable. 

6. OTHER RELATED WORK 

There is a rich literature on query rewriting using views for de- 
terministic XML data. XPath rewriting using only one view [36, 
25, 37] or multiple views [6, 34, 5, 8, 26] was the topic of several 
studies. They differ in the completeness guarantees they provide, 
or the assumptions they rely on. 

Some join-based rewriting methods either give no completeness 
guarantees [6, 34] or can do so only if there is extra knowledge 
about the structure and nesting depth of the XML document [5]. 
Others can only be used if the node Ids are in a special encoding 
that accounts for structural information [34]. Rewriting more ex- 
pressive XML queries using views was studied in [14, 18, 29]. 

[17] studies query answering using views for relational proba- 
bilistic data. There is little work dealing with the optimization of 
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query answering for probabilistic XML. A system that uses rela- 
tional probabilistic databases for storying and managing probabilis- 
tic XML is proposed in [20]. Approximate computation of proba- 
bilities for tree-pattern queries over probabilistic XML is studied 
in [22, 33]. We presented some preliminary results on rewriting for 
probabilistic XML at a workshop without formal proceedings [12]. 

7. CONCLUSION AND FUTURE WORK 

Our work is the first to address the problem of answering queries 
using views over probabilistic XML. Since in a probabilistic set- 
ting queries return answers with probabilities, view-based rewriting 
goes beyond the classic problem of retrieving answers from XML 
views. Thus, the new challenge raised in the probabilistic setting 
is to find probability-retrieving functions that can access only view 
results, while being able to compute the probabilities of answers. 

We identified large classes of XPath queries - with child and de- 
scendant navigation and predicates - for which there are efficient 
(PTime) algorithms, considering the rewriting problem under the 
two possible semantics for XML query results: with persistent node 
identifiers and in their absence. Accordingly, we considered rewrit- 
ings that can exploit a single view, by means of compensation, and 
rewritings that can use multiple views, by means of intersection. 
Recall that (direct) query answering for probabilistic XML model 
considered here is also polynomial in data and intractable in query 
complexity [22]. 

All our results are practically interesting, as they allow expres- 
sive queries and views, with descendant navigation and path filter 
predicates, and our decision procedures are based on easily verifi- 
able criteria on the query and views. For both semantics, the evalu- 
ation of an alternative plan is no more expensive then query evalua- 
tion over probabilistic XML. Moreover, rewritings based solely on 
intersection would require only the computation of the f r function, 
either by a product formula or by solving a S(q, V) system, opera- 
tions that should cost significantly less than the dynamic program- 
ming approach for query evaluation over the original data [22]. 
Even for plans that do use compensation (which may require TP- 
evaluation over view extensions), the costs should be reduced in 
practice, especially if extensions are much smaller that the original 
p-document. We intend to evaluate the impact of these techniques 
in practice, in a probabilistic XML management system. Also, 
heuristics for choosing the views that participate in a rewriting, tai- 
lored to the setting of probabilistic XML data, may represent valu- 
able optimizations. Beyond the natural choice of just caching the 
probabilistic results, keeping and exploiting for rewritings a sort of 
why-provenance of probability values is also an interesting direc- 
tion for future research. 

Another possible direction for future work is to broaden the set- 
ting to other models for probabilistic XML data. The p-documents 
studied in this paper have local probabilistic dependences, while 
there are models allowing for more complex probabilistic interac- 
tions between remote fragments of data [32]. For these types of 
data, query answering is intractable (already in data complexity) 
and it would be interesting to see under which conditions we can 
gain tractability by relying on views. 
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